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The accurate staging of liver fibrosis is of paramount importance to determine the state of disease 
progression, therapy responses, and to optimize disease treatment strategies. Non-linear optical microscopy 
techniques such as two-photon excitation fluorescence (TPEF) and second harmonic generation (SHG) can 
image the endogenous signals of tissue structures and can be used for fibrosis assessment on non-stained 
tissue samples. While image analysis of collagen in SHG images was consistently addressed until now, 
cellular and tissue information included in TPEF images, such as inflammatory and hepatic cell damage, 
equally important as collagen deposition imaged by SHG, remain poorly exploited to date. We address this 
situation by experimenting liver fibrosis quantification and scoring using a combined approach based on 
TPEF liver surface imaging on a Thioacetamide-induced rat model and a gradient based Bag-of-Features 
(BoF) image classification strategy. We report the assessed performance results and discuss the influence of 
specific BoF parameters to the performance of the fibrosis scoring framework. 



The excessive accumulation of newly synthesized extra- cellular matrix proteins in the liver tissue results in 
fibrosis which is the hallmark of chronic liver diseases'. Fibrosis progression is closely related to function 
failure and neoplastic generation^, therefore monitoring the histo-pathological information connected with 
liver fibrosis is necessary for the accurate diagnosis of chronic liver diseases and for establishing appropriate 
therapies. Although the routine histological assessment of liver fibrosis based on biopsy samples is invasive and 
can be subjected to staining variations, sampling errors and inter- and intra- observer discrepancies, it remains 
the best standard for fibrosis assessment"". Various non-invasive diagnostic tools, such as serum biomarker assays* 
and liver stiffness measurements^, have been reported but none of them can provide histo-pathological informa- 
tion at the tissue and cellular level, which is the most direct and convincing evidence for the diagnosis of liver 
fibrosis by a pathologist. 

Nonlinear microscopy for intrinsic two-photon excitation fluorescence (TPEF)'' and second harmonic gen- 
eration (SHG)' imaging has been demonstrated as a useful imaging tool for the qualitative and quantitative 
assessment of various diseases" '^. TPEF/SHG microscopy can considerably enhance the imaging penetration 
depth and reduce photobleaching and phototoxicity compared to conventional microscopy. SHG is a nonlinear 
nonresonant and coherent process that plays an important role in tissue imaging connected to the fact that non- 
centrosymmetric structures (eg. Collagen) exhibit a nonvanishing second-order susceptibility tensor x'^' that 
under the influence of an external electric field generates a nonlinear optical signal at exactly half the wavelength 
of the excitation source. Conversely, the two-photon excitation of molecules is a nonlinear resonant and 
incoherent process that involves the simultaneous absorption of two photons whose combined energy is sufficient 
to induce an electronic transition to an excited electronic state. Excited by these two photons, a fluorophore acts 
in the same way as if excited by only one photon, emitting a single photon whose wavelength is only determined 
by its intrinsic characteristics, such as fluorophore type, chemical structure, etc. In TPEF, each excitation 
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photon usually carries half of the energy that is needed to excite 
the fluorophore, so the wavelength for the two photons is roughly 
double of the wavelength of the one photon that could have been 
used in the same purpose"''". TPEF/SHG microscopy can be used for 
imaging endogenous signals of tissue structures enabling the assess- 
ment of various conditions in non-stained tissue samples, and thus 
contribute to avoiding observer errors that are related to staining 
variations. 

TPEF/SHG microscopy has already been employed for the qual- 
itative assessment of liver fibrosis" as histo-pathology features of 
liver tissue found in conventional histology slides could be success- 
fully visualized in TPEF/SHG images'^-''. Quantitative assessment of 
liver fibrosis through image analysis has also been adopted to 
complement the conventional qualitative assessment''*''^. The advan- 
tages of quantitative assessment include the minimization of intra- 
and inter- observation variations"" and the speed at which a 
diagnostic can be reached, but most of the available quantitative 
studies of liver fibrosis are limited to the usage of information from 
SHG images only. Although the collagen deposition and architecture 
changes that can be visualized in SHG images are significant signa- 
tures of fibrosis progression, the cellular and tissue information in 
TPEF images are equally important as they provide information on 
various relevant aspects such as hepatic cell inflammation, apoptosis 
or portal hypertension. The use of TPEF images for quantitative 
assessment of liver fibrosis has not yet been studied to its fuU poten- 
tial. In a previous study" it was demonstrated that the bile duct cell 
proliferation area extracted from the TPEF image is a useful indicator 
for monitoring fibrosis progression in a bile duct ligation animal 
model, but it is disease specific and might not be adapted to other 
liver diseases. 

The experiment presented in this study addresses this situation by 
combining TPEF imaging and a more general computer vision clas- 
sification method, Bag-of-Features'"" (BoF) for the purpose of 
quantification and automatic classification of liver fibrosis samples 
in a Thioacetamide (TAA)-induced rat model. BoF'"* methods are 
inspired from the Bag-of- Words (BoW) text categorization methods 
used in information retrieval. By BoW, a document can be classified 
as belonging to a particular category based on a normalized his- 
togram of word counts. BoF methods, also known as Bag-of- 
Visual- Words, adapt this text categorization approach to a visual 
categorization one, by replacing the dictionary of textual words with 
a dictionary of visual ones, usually referred to as "visual features". 
Different BoF methods use different types of visual features, such as 
textons, raw image data, invariant descriptors of image patches, 
descriptors of affine invariant interest points, or others. To represent 
an image, BoF uses a histogram to indicate the number of occur- 
rences of the visual words that take part in a dictionary in the respect- 
ive image. The dictionary {a.k.a. codebook) is typically built by 
running a clustering algorithm over a large set of visual features in 
order to divide them into distinct groups and to identify the repres- 
entative of each group (e.g., cluster mean). Given a novel training or 
test image, visual features are detected in it and assigning them their 
nearest matching terms from the visual vocabulary results in a nor- 
malized histogram of the quantized features detected in the image", 
which is called the 'term vector'. The term vector is practically the 
image representation used in BoF strategies. BoF has been success- 
fully used for tasks such as category-level recognition™'^', object or 
shape retrievaP^"^^, content-based image and video retrievaP'""^", 
tracking^'''", pattern mining"", scene classification''^'''' or biomedical 
X-ray image classification'"''''''. Among the reasons for which this 
technique has attracted great attention in recent years are simplicity, 
effectiveness and its modular structure that makes it easily adaptable 
to a wide range of applications in various fields. In the past decade 
BoF has been introduced to the field of histopathology image clas- 
sification'"'"" but to the best of our knowledge, it has neither been 
used for classification of fibrosis stages in liver images, which is the 



subject of the experiment that we present in this paper, nor in asso- 
ciation with the imaging method that we use, TPEF. 

Our results clearly demonstrate the utility of TPEF imaging for the 
quantitative assessment of liver fibrosis, and show that a gradient 
based BoF strategy can be used to exploit TPEF image content varia- 
tions connected to the cellular and tissue structure changes assoc- 
iated with fibrosis progression in a diagnostic purpose. 

The importance of this experiment is well connected to the poten- 
tial in vivo application for liver surface scanning of a TPEF/SHG 
endoscope. The parallel use of such a tool could consistently increase 
the level of information that is currently collected during a liver 
biopsy intervention, and could represent a key tool for patients 
who cannot be subjected to liver biopsy due to various medical con- 
ditions. Since the liver surface is surrounded by a thick collagen layer 
called the Glisson's capsule, the penetration depth of SHG signals in 
sub-capsule regions attenuates significantly. Thus SHG imaging 
alone cannot provide enough information when the liver surface is 
scanned. Hence, studying the suitability of TPEF image analysis for 
fibrosis assessment is of great importance to future TPEF/SHG endo- 
scopy applications. 

Besides illustrating the potential of TPEF liver surface imaging for 
quantitative liver fibrosis assessment, the experiment that we present 
highlights the way specific BoF parameters can influence the clas- 
sification performance in the case of the addressed problem. We 
aimed at reaching a better understanding of the mechanisms that 
influence the BoF classification of TPEF liver fibrosis images, by 
experimenting with specific BoF parameters such as the spacing of 
the grid by which the features are extracted, the scale of a patch 
around each grid point that contributes to its descriptor or the size 
of the codebook. We consider presenting our findings on the BoF 
parameters influence to the classification performance to be import- 
ant because to date this is the first study to combine TPEF imaging 
with BoF, and the first experiment that deals with the classification of 
TPEF images by exploiting the potential of Scale Invariant Feature 
Transform (SIFT)'" descriptors. 

Results 

Qualitative assessment of TPEF/SHG images from liver surface. 

The liver samples were imaged under reflective mode of TPEF/SHG 
microscopy up to 70 |.im from the liver surface. The SHG signals of 
the GUsson's capsule at the liver surface are strong but attenuate 
rapidly in the sub-capsule region as shown in Figure lA, whereas 
the TPEF signals attenuate more slowly and are much stronger 
than SHG signals in the sub-capsule regions and therefore can 
provide more detailed tissue information. The SHG image of the 
Glisson's capsule as well as the TPEF and SHG images with the 
highest signal intensity in the sub-capsule region are illustrated in 
Figure IB. The cellular structures can be clearly observed in the 
TPEF image and are complementary to the collagen information in 
the SHG image. Sub-capsule TPEF and SHG images of liver tissue 
samples fi^om five fibrotic stages according to the Metavir scoring 
system are exemplified in Figure IC; it can be noticed that cellular 
morphology changes can be successfully observed along fibrosis 
progression in TPEF images, whereas changes of coUagen structures 
are not obvious due to the low signal intensity in the SHG images. 

Quantitative assessment of TPEF images from liver sub-capsule 
region. Further on we present the quantitative assessment results 
that we obtained by using the DSIFT-BOF strategy presented in 
the Methods section. The impact of three DSIFT-BOF parameters 
(grid spacing, bin size and codebook size), that are associated with 
the spatial and architectural information of the tissue morphology, 
was investigated in the case of five scenarios. The classification of all 
fibrosis stages is important for the prognosis of fibrosis progression 
and for establishing optimal treatment therapies, and for this reason 
one of the problems towards which we have turned our attention was 
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Figure 1 | SHG & TPEF imaging on the liver surface, (a) SHG & TPEF signal intensity at liver surface in respect to the position of the Glisson's capsule; 
(b) SHG image collected in the Glisson's capsule region and highest signal intensity TPEF and SHG images collected in the sub-capsule region. The 
cellular structures that can be observed in the TPEF image are complementary to the coUagen information in the SHG image, (c) Pairs of TPEF & SHG 
images on fibrotic liver tissue from Stage 0 to Stage 4 collected in the sub-capsule region. The TPEF images show that in the normal liver the hepatocytes 
are well aligned along the sinusoidal spaces. In the course of fibrosis progression, such alignment is destroyed with enlarged nuclei size and nuclei to cell 
ratio. The tissue structure becomes messy with larger empty spaces occupied by deposited collagen. The coUagen structures are not obvious in the 
corresponding SHG images due to the low signal intensity. Field-of-view size is 450 nm X 450 ^m. 



the classification between all five distinct fibrosis stages. Stage 0 vs. 
Stage 1 vs. Stages 2 vs. Stage 3 vs. Stage 4 (S0_S1_S2_S3_S4). 
Additionally, we evaluate the performances of the DSIFT-BOF 
framework in respect to predicting specific end points in the 
fibrosis progression, a task that has huge impact in respect to 
clinical planning. For example, the prediction of significant fibrosis 
(stages 2-4) versus non-significant fibrosis (stages 0-1) is critical for 
assessing the need of antiviral therapies, while the detection of 
cirrhosis (stage 4) versus non-cirrhosis (stages 0-3) is an important 
indicator for the end stage of fibrosis progression, which is associated 
to a higher risk of developing liver cancer such as hepatocellular 
carcinoma. Specific endpoint prediction is evaluated in the frame 
of four binary classification scenarios: Stage 0 vs. Stages 1, 2, 3, 4 
(S0_S1234); Stages 0, 1 vs. Stages 2, 3, 4 (S01_S234); Stages 0, 1, 2 vs. 
Stages 3, 4 (S012_S34); and Stages 0, 1, 2, 3 vs Stage 4 (S0_S1234). 

We evaluate the classification performances of DSIFT-BOF for the 
five fibrosis classification scenarios mentioned in terms of area under 
Precision-Recall (PR) curves, PR-area. In a binary decision problem a 
sample can be classified as either a positive or negative. The decision 
of the employed classifier can be represented in a structure known as 
confusion matrix, which consists of four categories: True Positives 
(TP), samples that are correctly labeled positive, False Positives (FP), 
samples that are incorrectly labeled as positives, and similarly True 



Negatives (TN) and False Negatives (FN). The confusion matrix can 
be used to construct the points of both PR and Receiver Operator 
Characteristic (ROC) spaces. In PR space. Recall, aka Sensitivity, is 
plotted on the x-axis while Precision, aka. Positive Prediction Value, 
on the y-axis. Eq. 1 and Eq. 2 give the definition of each metric. We 
have generated PR curves by varying the value of the classification 
criterion as described in the Methods section. The areas under the PR 
curves were calculated by trapezoidal approximations. The relation- 
ships that take place between the PR and ROC curves are very well 
described in (Davis and Goadrich, 2006)"". 



Precision — 



True Positives 



Recall — 



True Positives + False Positives 



True Positives 



True Positives + False Negatives 



(1) 



(2) 



The DSIFT-BOF algorithm (Fig. 2) that we have experimented 
depends on three variable parameters: the grid spacing, the bin size 
and the codebook size. The grid spacing is the distance in pixels 
between extracted features while the bin size refers to the dimension 
of the SIFT bin'". A schematic illustration of these two parameters is 
shown in Figure 3. The codebook size is the number of visual words 
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Extraction of features for vocabulary building 



Test image 



Stage 0 




Classification result 



1 2 3 4 5 



Figure 2 | Schematics of the DSIFT-BOF framework. A codebook feature space is created by extracting DSIFT descriptors at fixed grid locations from all 
the images that are dedicated to vocabulary building. A codebook is generated by running a clustering algorithm that partitions the codebook feature 
space into k regions. The centroids of these regions (a.k.a. clusters) represent the codebook terms. The codebook allows for BoF representations, term 
vectors, to be assigned to training and test images. The term vector is a histogram that indicates the number of occurrences of the codebook terms in an 
image. Term vectors are assigned to the training images, and will be further on used as ground truth data by a classifier. An image is tested by assigning a 
term vector to it, and running this term vector through a classifier that uses the term vectors of the training images to indicate its fibrosis stage. The 
classifier that we use in this experiment is weighted k-NN. 



grid spacing 




Figure 3 | Illustration of grid spacing and DSIFT feature extraction. For consistency reasons we extract the same number of features from aU vocabulary, 
training or test images. The image locations from where the features are extracted are fixed according to a grid. A sparse grid, equivalent to a low number 
of features, can be responsible for dismissing important image information, while a dense grid, equivalent to a high number of features, is computationally 
demanding and may lead to redundant information. The DSIFT descriptors extracted from the grid locations are histogram representations that combine 
local gradient orientations and magnitudes from a neighborhood around a keypoint, indicated by the bin size. More precisely, the descriptor is a 
histogram of gradient location and orientation, where location is quantized into a 4 X 4 location grid and the gradient angle is quantized into 8 
orientations, one for each of the cardinal directions. The resulting descriptor is a normalized vector with the dimension of 128 elements. 
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in the dictionary (referred throughout the paper as 'codeblocks') 
employed by the BoF classification framework. 

The values of these three parameters that we used in our experi- 
ment are presented in Table 1. The case of running DSIFT-BOF for 
one combination of the three parameters is referred throughout the 
paper as to a "scenario". The results that we present are based on a 
number of 168 scenarios which equals the total number of all possible 
combinations of the three parameters tested (4 grid spacings X 6 bin 
sizes X 7 codebook sizes). The influences of grid spacing, bin size and 
codebook size to the achievable performance are presented in the 
following sections. 

Grid spacing influence on the DSIFT-BOF performance. Grid 
spacing, resulting also in the total number of features extracted 
from an image, refers to the density of extracted features (Fig. 3). 
The use of various grid spacings has been reported in the literature; 
for example Lazbenik^' reported using SIFT descriptors'"* extracted 
from a dense grid with a spacing of 8 pixels, and Tamaki'"' evaluated 
the influence of using grid spacing of 5,10,15 pixels. In general, 
smaller grid spacing performs better as it generates more features 
but the offered advantage comes at the cost of increased computation 
time required for clustering, training and classification. We tested 
regular grids with 10, 20, 40 and 60 pixels spacing. The number of 
features extracted from an image of 1024 X 1024 pixels by using a 
grid spacing of 10 pixels is 36 times higher than when using one of 60 
pixels (Supplementary Table 1). 

The values of the achieved PR-areas in the case of the 
'S0_S1_S2_S3_S4' classification scenario are illustrated in 
Figure 4a. For each grid spacing size we have evaluated 42 scenarios 
(7 codebook sizes, 6 bin sizes). For all evaluated grid spacings, 10, 20, 
40, 60 pixels, the best classification in terms of PR-area is observed for 
Stage 0 images, and worst classification is observed for Stage 1 
images. As expected, the highest mean PR-area value is observed 
for a grid spacing of 10 pixels that is equivalent to 10 '404 features 
per image, while the lowest PR-area, 52% lower than the maximum, 
is observed for the highest grid spacing, 60 pixels, which is equivalent 
to 289 features per image. Even if the differences between the min- 
imum and maximum mean PR-area values are lower, the same trend 
can be observed also in the case of the four binary classification 
scenarios evaluated (Fig. 4 b,c,d,e). We observe a 15% decrease in 
the case of S0_S1234, 5% for S01_S234, 22% for S012_S34 and 12% 
for S0123_S4. 'SOI' and 'S012' exhibit a different dependence of the 
grid spacing than the other evaluated binary classes, the PR-area 
slightly increasing with higher grid-spacing values. 

Bin size influence on the DSIFT-BOF performance. One of the 

important parameters of a descriptor-based BoF method is the 
dimension of the patch around a keypoint that contributes to its 
descriptor. In the Scale Invariant Feature Transform (SIFT) 
method'" the dimension of this patch derives from the size of the 
bins (Fig. 3), and is related to the SIFT keypoint scale in the Gaussian 
Scale Space (GSS) by a multiplier. The DSIFT'" *^ features used in our 
framework are not assigned a scale as they are not extracted by using 
sift's Difference-of-Gaussian (DoG) detector but from fixed 
locations corresponding to a grid. We adopt the concept of SIFT of 
correlating the bin size with the GSS and use for different bin sizes 
different representations of the image in the GSS. The GSS 



Table 1 | The values of the evaluated BoF parameters 

Parameter Values 

Grid spacing 1 0, 20, 40, 60 

Bin size 2,4,6,8,10,12 

Codebook size 50, 250, 500, 750, 1 000, 1 250, 1 500 



representations of the image are obtained by convolving it with an 
isotropic Gaussian kernel of different standard deviations (Supple- 
mentary Fig. 1). 

In each particular scenario, the dimension of the patch is the same 
for all grid keypoints and is directly related to the bin size and 
smoothing level. We have evaluated six different bin sizes (2, 4, 6, 
8, 10, 12 pixels) corresponding to different smoothed instances of the 
image, and observed how this influences the DSIFT-BOF classifica- 
tion results on TPEF images of liver fibrosis. In previous reports'"" it 
has been proposed to use simultaneously (within the same run) 
features of different scales, or to include information originating at 
different scales in the same descriptor. We have chosen however to 
use features of the same scale within a particular scenario in order to 
grasp a better understanding of the scale's influence to the results. 

Figure 5a presents the achieved PR-areas in the case of 
S0_S1_S2_S3_S4. For each bin size, we have evaluated 28 scenarios 
(7 codebook sizes, 4 grid spacings). For all evaluated bin sizes (2, 4, 6, 
8, 10, 12), best classification in terms of PR-area is observed for Stage 
0 while worst classification is observed for Stage 1. The highest mean 
PR-area is observed for a bin size of 6 pixels, while the lowest mean 
PR-area, 62% lower than the maximum, is observed for a bin size of 
2 pixels. A similar dependence with the bin size can be observed 
also in the case of the four binary classification scenarios tested 
(Fig. 5 b, c, d, e). A bin size of 6 pixels provides a maximum mean 
PR-area in the case of S012_S34 and S0123_S4, while the position of 
the maximum PR-area shifts to 'bin size = 8' for S0_S1234 and 
S01_S234. For all classification scenarios evaluated we can observe 
a consistent rise in the performance between bin size values of 2 and 6 
pixels, and low differences between bin sizes ranging from 6 to 12 
pixels. For all evaluated scenarios a bin size of 2 pixels provides worst 
results. 

Codebook size influence on the DSIFT-BOF performance. The 

BoF representation of an image (aka 'term vector') consists in a 
histogram of the visual words defined in a codebook (visual 
dictionary) that can be found in it, as described in the introductory 
section. The codebook is built by using a clustering (vector 
quantization) algorithm, which in our experiment is K-means*'. 
An important parameter to be decided before commencing 
clustering is the number of codeblocks (aka visual words) that the 
dictionary contains. Choosing a particular value for this parameter 
depends on the type and content of the images to be classified as the 
codeblocks represent key image content components. Using fewer 
codeblocks has the advantage of potential higher discriminative 
power, while using more codeblocks has the advantage of potential 
higher sensitivity. In most applications, a higher number of visual 
words yields better discrimination between classes, at the expense of 
higher computational power needed for clustering, which is directly 
related to the size of the dictionary. Therefore, for deciding upon a 
dictionary size to be used for a specific application, one should 
identify the optimum tradeoffs of accuracy and computational 
efficiency. Even though a higher dictionary size provides better 
results in most applications on natural image classification'"™, the 
size of a dictionary was found not to be particularly important in a 
medical image classification task'*'*. Caicedo et al."" also report that 
their SIFT-based codebook required fewer codeblocks to express all 
different patterns in the histo-pathological image collection that they 
have tested, and claim their results to be consistent with the rotation 
and scale invariance properties of the SIFT descriptor. In this section 
we present our experiments on the codebook size influence to the 
DSIFT-BOF classification of TPEF liver fibrosis images. Seven 
codebook dunensions (50, 250, 500, 750, 1000, 1250, and 1500) are 
evaluated. 

Figure 6a presents the codebook size influence on the 
S0_S1_S2_S3_S4 classification scenario. For each codebook size, 
we have evaluated 24 scenarios (6 bin sizes, 4 grid spacings). As in 
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Figure 4 | Influence of the grid spacing on liver fibrosis classification performance by DSIFT-BOF. (a) Stage 0 vs. Stage 1 vs. Stage 2 vs. Stage 3 vs. Stage 
4; (b) Stage 0 vs. Stages 1,2,3,4 (c) Stages 0,1 vs. Stages 2,3,4 (d) Stages 0,1,2 vs. Stages 3,4 (e) Stages 0,1,2,3 vs. Stage 4. Grid spacing values refer 
to the number of pixels between feature locations. 



the previous classification experiments, for each of the evaluated 
codebook sizes, best classification results can be observed for Stage 
0 images, while worst results can be observed in the case of Stage 1 
images. The highest mean PR-area is observed for a codebook size of 
1 500, which is 37% higher than the worst case, the one of a 50 element 
codebook. The mean PR- area differences between codebooks of 750, 
1000, 1250 and 1500 elements take values < 1%. 

Except for the case of the S012_S34 scenario, the PR results for the 
binary classification scenarios illustrate a similar trend, with per- 
formance increasing with codebook size (Figure 6 b,c,d,e). The max- 
imum value of the mean PR-area is achieved for codebook 
dimensions of either 1250 or 1500 elements, with —1% difference 
between the two cases. The minimum mean PR-area always occurs 
for the lowest codebook dimension, 50 elements. The differences 
between the maximum and minimum PR-area values range from 
6% in the case of S0123_S4, to 16% and 20% for S0_S1234 and 
respectively S01_S234. In the case of S012_S34 the maximum mean 
PR-area value is achieved for a codebook dimension of 500 elements, 
10% higher than the minimum value. 



Summary of overall results. The mean PR-area values for all five 
classification scenarios evaluated, as well as the BoF configurations 
that yield the minimum and maximum PR-areas for each of the 
evaluated scenarios are presented in table 2. 

Discussion 

In the presented experiment we have evaluated a SIFT based BoF 
framework, DSIFT-BOF, in respect to the potential of BoF methods 
for classifying liver fibrosis images collected by TPEF imaging. To the 
best of our knowledge this is the first experiment to combine TPEF 
imaging with a 'Bag-of- Features' image classification strategy and the 
first experiment to present an approach for the quantitative evalu- 
ation of liver fibrosis based on TPEF images collected from the liver 
surface. 

The performed work was aimed at exploiting TPEF data collected 
on the liver surface in the purpose of assessing fibrosis stages and 
specific endpoints, at reaching a better understanding of how specific 
BoF parameters influence the classification performance in regard to 
the addressed problem and at introducing BoF to the fields of TPEF 
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Figure 5 | Influence of the bin size on liver fibrosis classification performance by DSIFT-BOF. (a) Stage 0 vs. Stage 1 vs. Stage 2 vs. Stage 3 vs. Stage 4; (b) 
Stage 0 vs. Stages l,2,3j4 (c) Stages 0,1 vs. Stages 2,3,4 (d) Stages 0,1,2 vs. Stages 3,4 (e) Stages 0,1,2,3 vs. Stage 4. Bin size values, given in pixels, 
refer to the size of the regions that contribute to the bin histograms, which constitute the DSIFT descriptor. 



imaging and liver fibrosis assessment. We assessed the classification 
performances of the framework for the cumbersome five-class prob- 
lem (S1_S2_S3_S4_S5) and for four binary classification scenarios 
important in respect to specific endpoint prediction, such as non- 
fibrosis and fibrosis (S0_S1234), non-significant fibrosis and signifi- 
cant fibrosis (S01_S234), mUd fibrosis and severe fibrosis (S012_S34) 
and non-cirrhosis and cirrhosis (S0123_4). Best results in terms of 
mean PR-area are observed in the case of the S012_S34 scenario, 
while worst results can be observed as expected for the 
S0_S1_S2_S3_S4 scenario, which is generally considered a difficult 
classification scenario (Table 2). Taking into account that this experi- 
ment represents a first attempt in many regards, the overall results 
that we obtained (Fig. 4-6, Table 2) are promising and depict the 
potential of using TPEF imaging on the liver surface as a diagnostic 
method. The results reveal as well that combining gradient based BoF 
methods with TPEF imaging and using BoF strategies for quantita- 
tively assessing liver fibrosis stages holds great potential. 

The three BoF parameters to which we have focused our attention 
on are grid spacing, which is equivalent to the number of features 



extracted per image, bin size, and codebook size. The influence of 
different grid spacings on the performance of the DSIFT-BOF frame- 
work were found to be consistent for the majority of the classification 
scenarios evaluated, and propose a high density grid as the solution to 
be preferred despite the higher computational time implied. The 
worst performances of the classification framework were found to 
occur when using a bin size of 2 pixels, while best were observed for 
bin size values of 6 and 8 pixels This situations occurs due to the fact 
that in the case of low bin sizes neighbor gradients are highly corre- 
lated and are very likely to hit the same orientation, so the chance of 
having orientation bins equal to zero is significant. Such a situation 
prevents the full dimensionality of the descriptor from being 
exploited, since a considerable amount of '0' elements will occur, 
yielding reduced specificity. Increasing the size of the patch that 
contributes to the descriptor reduces the occurrence of '0' values in 
the descriptor, making it more discriminative. Our experiments on 
codebook size influence show that the classification performance is 
dependent of the codebook size only up to a point. The classification 
performance improvements are consistent for most scenarios when 
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the codebook size is increased from 50 to 750 elements, but the PR- 
area differences observed between high dimensional codebooks (eg. 
1000, 1250, and 1500) are very low. As higher codebook size means 
higher computational demand, we can think of a codebook of 1000 
elements as providing an optimal 'computational time/performance' 
ratio. The minimum tested codebook dimension (50 elements) per- 
forms worst in all scenarios, but since the computational demand for 
this reduced number of clusters is considerably lower than in the 
other cases (over ten times lower than in the case of the highest tested 
codebook size - 1500 elements), it could still be considered an option 
for real-time, online or mobile applications. The most computation- 
ally expensive stage of the suggested DSIFT-BOF algorithm is the k- 
means clustering procedure. This stage aims to partition n observa- 
tions (in our case n is sum of all features extracted from all vocabulary 
images) into k clusters, having each observation belonging to the 
cluster with the nearest mean. This problem is computationally dif- 
ficult, NP-hard in general Euclidean space d even for 2 clusters. If k 
and d are fixed, the problem can be exactly solved in time 0(n'"'*' log 
n)*^. In our experiment we use seven codebook dimensions (code- 
book dimension = k) and the observations are 128-dimensional. A 



higher PR- area was observed in the binary classification scenarios for 
the classes containing more stages (eg. non-cirrhosis). Besides being 
related to image content, this situation is related as well to a statistical 
reason that influences the weighted k-NN nearest neighbor strategy 
used for classification. If more fibrosis stages correspond to a class, 
and the number of associated training images for each class is pro- 
portional to the number of fibrosis stages (like in the case of our 
implementation), there is a higher probability for the k-NN clas- 
sification criterion to be met for a sample belonging to that particular 
class. In consequence, in such a situation the classes containing more 
fibrosis stages are privileged in comparison to the others. 

The presented BoF based framework is a modular one, and enhan- 
cing any of its components, independently of the others, leads to an 
enhancement of the overall results. Our future work in this field will 
focus on enhancing the method by implementing various modifica- 
tions to the algorithm, such as including orientation and spectral 
information in the descriptor, using other classifiers more complex 
than k-NN (such as support vector machines), optimizing the num- 
ber and the type of images used for vocabulary building and for 
training or by enabling the automatic selection of optimal BoF 
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parameter configurations. In tlie same time, our future work aims at 
developing an advanced iterative approacli whicli will exploit differ- 
ent classifiers, such as Naive-Bayes, for processing the data resulted 
after iteratively running a number of BoF scenarios, or even employ- 
ing methods that combine different classifiers, such as bagging, 
boosting or stacking'**'. Finally, another important research direction 
for the future is designing and exploiting mixed codebooks contain- 
ing both 2D and 3D features, such as volumetric descriptors*^, spe- 
cific 3D morphological information*"*, fractal measures (eg. fractal 
dimension, fractal lacunarity) or spectral information. 

The presented experiment brings evidence that the quantification 
of cellular and tissue information in TPEF images collected from the 
liver surface is equally important to the characterization of collagen 
deposition in deeper liver tissue sections by SHG for the assessment 
of liver fibrosis. We consider this finding extremely important since 
the liver tissue information provided by the two imaging techniques 
are complementary, which means that combining the two would 
yield higher diagnostic sensitivity and specificity. While main 
emphasis was placed until now on the analysis of SHG images, we 
have demonstrated the potential use of TPEF imaging for the quan- 
tification and automated diagnosis of liver fibrosis. 

Besides the fact that to date TPEF data on liver samples has not yet 
been exploited at its full potential in respect to the liver fibrosis 
assessment problem, another reason that has motivated our experi- 
ment is the possibility to easily extend a TPEF image based algorithm 
to be used with fluorescence data collected by conventional widefield 
or confocal microscopy/endomicroscopy. While the availability of 
TPEF/SHG capable systems is still limited mostly due to the prohib- 
itive costs of femtosecond laser sources, conventional widefield or 
confocal fluorescence capable systems are available in most institu- 
tions where biomedical research is conducted. 

Another important aspect of this experiment consists in the fact 
that TPEF images used this study were collected from the liver sur- 
face, which demonstrates the potential application of liver surface 
scanning with nonlinear endomicroscopy"*^"^'. Such techniques 
could replace in some fibrosis assessment scenarios the more invasive 
liver biopsy, or could be used in a parallel association with liver 
biopsy for maximizing the level of information that is collected dur- 
ing an intervention. 



The combined TPEF - BOF classification framework proposed in 
this study provides promising results, and thus holds significant 
potential in respect to the liver fibrosis assessment problem. In the 
same time, as multi-photon imaging of tissue/cell is becoming a 
widely used method to study different medical diseases and condi- 
tions^^, the proposed framework could represent a consistent solu- 
tion for other diagnostic scenarios such as TPEF based 
differentiation between normal, inflammatory and neoplastic lung"', 
normal and cancerous gastric tissues^"* or normal, benign, and cancer 
affected breast tissues^^. The influence of the three DSIFT-BOF para- 
meters that we have evaluated in our experiment is directly con- 
nected to the image content in terms of tissue morphology, and for 
this reason the presented results are mainly relevant for the 
addressed problem: TPEF based liver fibrosis diagnostic. Irrespec- 
tively, TPEF images collected on different types of mammalian tis- 
sues, including human tissues, share common contrast mechanism 
related characteristics and for this reason this study could potentially 
impact other similar classification experiments that combine TPEF, 
Bag-of-Features and gradient based descriptors. 

Methods 

Imaging setup. The non-linear optical microscope used in the experiment for Two- 
photon Excited Fluorescence (TPEF) data acquisition was based on a confocal 
imaging system (LSMSlOMeta, Carl Zeiss, Jena, Germany) coupled to an external 
tunable mode-locked Ti:Sapphire laser (Mai-Tai broadband, Spectra-Physics, 
USA)^^. The laser line was tuned to a 900 nm wavelength and routed by a dichroic 
mirror {reflect > 700 nm, transmit < 543 nm), through an objective lens {Plan- 
Neofluar, 20X, NA — 0.5, Carl Zeiss, Jena, Germany) to the tissue specimen. TPEF 
signals were collected by the same objective lens in the epi-mode, passing through the 
dichroic mirror {reflect < 490 nm, transmit > 490 nm) and a 500-550 nm band- 
pass (BP) filter, before being recorded by a photomultiplier tube (PMT, Hamamatsu 
R6357, Tokyo, Japan). 

Sample preparation. 40 male wistar rats were used in this study. 35 rats were treated 
with thioacetamide (TAA), an organosulfur compound with the formula C2H5NS 
which is known to produce marked hepatotoxicity in exposed animals. 200 mg/kg of 
TAA were administered by intraperitoneal injection three times a week for up to 14 
weeks to induce liver fibrosis. The wistar rats were sacrificed at time-points of 2, 4, 6, 8, 
10, 12 and 14 weeks (n — 5 per time point). Another 5 rats were also sacrificed at week 
0, without treatment, as the control group. Cardiac perfusion with 4% 
paraformaldehyde was performed to flush out blood cells and the liver was fixed in 
formalin before harvesting. 
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After harvesting, the entire left lobe of each rat liver was placed on the microscope 
stage for imaging, and at random sites on the anterior surface of each liver sample 
TPEF images were collected at multiple depths (z-stacks). For each random site, a 
representative 2D image to be used in the presented experiment was selected from the 
corresponding z- stack by using an automatic reference frame estimator^^ that relies 
on image brightness, contrast and sharpness. The dimension of the field of view was 
450 |im by 450 jim, and the resolution of the images is 1024 X 1024 pixels. 

After performing TPEF imaging, 5 [xm thick liver slices were sectioned from the 
liver lobe and stained with Masson Trichrome (MT) stain kit {ChromaView advance 
testing, #87019, Richard-Allan Scientific) for fibrosis scoring by an experienced 
pathologist using the Metavir system^^. This system assesses histologic lesions in 
hepatitis using two separate scores, one for necroinflammatory grade and another for 
the stage of fibrosis (Stage 0, no fibrosis; Stage 1 , portal fibrosis without septa; Stage 2, 
portal fibrosis with rare septa, Stage 3, numerous septa without cirrhosis; Stage 4, 
cirrhosis). 

DSIFT features. During the past decade strong emphasis has been placed on the 
detection and description of affine-invariant regions^*''^^"^^ as numerous computer 
vision applications are based on image feature extraction and matching. Among various 
methods reported in the literature, the Scale -Invariant Feature Transform (SIFT)^^ 
became one of the most preferred choices because of its high accuracy^", relatively low 
computation time and the availability of open-source implementations. The original 
SIFT technique provides solutions for both the detection and the description of image 
keypoints but we have previously shown that sparsely detecting image keypoints by the 
Difference- of-Gaussian (DoG) method^^ proposed in SIFT is influenced by specific 
acquisition parameters of Laser Scanning Microscopy (LSM) such as photomultiplier 
amplification and laser beam power^^. This means that the number of keypoints that 
SIFT can automatically detect in LSM images, including TPEF images, can be highly 
different when these are collected under distinct acquisition configurations. The same 
situation has been observed^^ to take place as well in the case of Speeded-up Robust 
Features^^, another popular gradient based feature detection/description technique. In 
order to avoid BoF related problems that could occur due to these aspects, such as 
unbalanced dictionaries or inconsistent term vectors, we have chosen to use a grid 
approach instead of a feature-detection one, as similar grid based strategies were 
reported to perform better than feature-detection based strategies in other 
experiments^"'^^'^". In a grid based approach the same number of features is extracted 
from all images from fixed x,y coordinates imposed by a grid, as illustrated in Fig. 3. 

The visual features that we have used in our experiment are Dense- SIFT (DSIFT) 
features*'-*^, a SIFT^^ variant. We have extracted these features by using the 'vl_dsift' 
function for Matlab (The MathWorks, Inc., Natick, Massachusetts, USA) available in 
the open-source VL-Feat library*^, in its exact form, which according to the authors is 
"roughly equivalent to running SIFT on a dense grid of locations at a fixed scale and 
orientation". The description method of DSIFT is similar to the one that SIFT uses: 
The keypoint descriptor is a histogram representation that combines local gradient 
orientations and magnitudes from a certain neighborhood around a keypoint. More 
precisely, the descriptor is in fact a 3D histogram of gradient location and orientation, 
where location is quantized into a 4 X 4 location grid and the gradient angle is 
quantized into 8 orientations, one for each of the cardinal directions'^. The resulting 
descriptor is a normalized vector with the dimension of 128 elements. The reason for 
which we have chosen to use SIFT descriptors instead of other visual features is that 
these descriptors are simple linear Gaussian derivatives which are more stable to 
typical LSM image perturbations, such as multiplicative or additive noise, than higher 
Gaussian derivatives or differential invariants. In the same time, the high dimension 
of the SIFT descriptors (128 elements) is equivalent to a high potential for the dis- 
criminative representation of image regions. 

DSIFT-BOF: implementation and evaluation. The experiment that we conducted 
was aimed at correctly labeling 200 images collected by TPEF on fibrotic mouse liver 
samples by using a Bag-of- Features (BoF) framework based on DSIFT features, 
DSIFT-BOF, which is schematically illustrated in Fig. 2. The complete set of 200 
images consists of five sub-sets of 40 images each, one sub-set for each of the five 
METAVIR fibrosis stages: Stage 0 to Stage 4. For training and verification purposes all 
images used in the study were labeled as corresponding to one of the five fibrosis stages 
based on the pathologist's evaluation of the MT stained samples that was performed 
before running the DSIFT-BOF experiment. Previous to running DSIFT-BOF each of 
the images was processed by Wiener filtering in order to compensate the additive and 
multiplicative noise which could affect local feature description and hence the results 
of the method. We have implement ed/evaluated DSIFT-BOF and performed the 
image filtering in a 2012b MATLAB Release (The MathWorks, Inc., Natick, 
Massachusetts, USA) that was equipped with the open source VL-Feat library^. 

In one run of the algorithm, 10 of the 40 images of each sub-set are tested, 15 
random images of the remaining ones are used for vocabulary building and the other 
15 images from the sub-set are used for training. Training images are used as ground- 
truth. In order to test all the images in a sub-set, we run DSIFT-BOF algorithm four 
times, each time testing 10 different images, and using other random combinations of 
images in the sub-set for vocabulary building and training. This procedure is 
schematically illustrated in Supplementary Fig. 2. 

Further on we present the steps of DSIFT-BOF, referring to a single run. As detailed 
in the introductory section, a typical BoF strategy requires representing the training 
and test images as term vectors. A term vector represents a normalized histogram of 
the visual words in a dictionary (codebook) that are found in an image. Therefore, the 
first step in the DSIFT-BOF algorithm is to construct the codebook. During a run 15 



images from each of the 5 fibrosis image sub-sets are used in this purpose. From each of 
these images the same number of features, which results from the size of the grid 
spacing (Supplementary Table 1), is extracted and added to the codebook feature space. 
Once all features are extracted from the 75 images that are used for vocabulary 
building, and the codebook feature space is fuUy populated, we run a square-error 
partitioning method: fc-means clustering^'-^^ for identifying the centroids (a.ka. clus- 
ters) of the codebook feature space. K-means clustering is a simple nonhierarchical 
method that aims to partition n observations, in our case the total number of features 
extracted fi^om the images used for vocabulary building, into k regions in which each 
observation belongs to the region with the nearest centroid. The centroids of the 
codebook feature space are identified after running k-means clustering and will rep- 
resent the elements of the DSIFT-BOF codebook After the codebook is built, we 
calculate the term vectors of the training images of each sub-set, and add them to a 
'training poof. The 'training pool' is thus comprised of 75 term vectors, as each of the 
five sub-sets contribute to it with 15 term vectors. For classifying a test image we 
calculate its term vector and employ a weighted k-Nearest Neighbor (k-NN) classifier^\ 
In k-NN classification, an object is classified by a majority vote of its neighbors, with the 
object being assigned to the class most coinmon among its k nearest neighbors. As an 
improvement to k-NN, a distance -weighted k-NN rule can be introduced with the 
basic idea of weighting close neighbors more heavily, according to their distances to the 
query. The weight used in the frame of the weighted k-NN classifier that we employed 
is 1/D, where D is the Euclidean distance between the term vector of a tested image and 
the term vector of a nearest neighbor. More precisely, after calculating the term vector 
of the test image, we search its k nearest neighbors (NNs), in terms of Euclidean 
distance, in the training pool. In each classification scenario, k equals the number of 
training images associated to the class containing the lowest number of fibrosis stages 
(the classification scenarios and classes are presented in the 'Results' section). For 
example, in the case of the S01_S234 classification scenario, the number of training 
images corresponding to the non- significant fibrosis class (stages 0-1) is 30, and 45 for 
the significant fibrosis class (stages 2-4), thus in this particular case k — 30; similarly, 
for the S0_S1234 classification scenario k — 15. Further on a classification is assigned to 
the tested image if a minimum of C NNs out of its total of k NNs belong to a specific 
class, and their cumulated weight is higher than the cumulated weight of the NNs 
belonging to the other class(es). If these conditions apply the tested image is classified 
as belonging to the same class with these minimum C dominant NNs. This weighted k- 
NN classification procedure is applied for every test image. 

As previously mentioned, in order for all the images in the set to be tested, four 
consecutive runs of DSIFT-BOF were needed. At the end of each of the four runs we 
have evaluated the assigned classifications for the tested images. We did this by 
comparing the results with the ground truth information provided by the histo- 
pathologist as a priori information. Performing this comparison yielded a number of 
True Positives (TP), False Positives (FP), True Negatives (TN) and False Negatives 
(FN), for each of the runs (eg. #TP_runl, #TP_run2, #TP_run3, #TP_run4). After all 
four runs were completed, and all the images in the sub-set had been tested, the results 
were merged by summing (eg. #TP — #TP_runl + #TP_run2 + #TP_run3 + 
#TP_run4). The resulted #TP, #FP, #TN, #FN values were used for calculating the 
Precision and Recall, explained in detail in the Results section. The Precision- Recall 
curves are generated by calculating the Precision & Recall for different values of C, the 
minimal number of dominant NNs required for assigning a classification. More 
precisely, for generating PR curves, C was varied between 1 and k. 

DSIFT-BOF: parameters. The three DSIFT-BOF parameters that are analyzed in the 
presented experiment are: grid spacing, bin size and codebook size. These parameters 
and their influence towards the DSIFT-BOF outputs are presented in detail in the 
'Results' section. The Matlab 'vl_dsift' function of the VLFeat open-source platform 
allows modifying the values for grid spacing and bin size through the following 
options that it accepts: 

• 'Step': This option controls the sampling density, which is the horizontal and 
vertical displacement of each feature center to the next ('grid spacing'). 

• 'Size': This option controls the scale of the extracted descriptors, i.e. the width in 
pixels of a spatial bin ('bin size'). 

The codebook size, which corresponds also to the dimension of the term vectors, is 
configured in the clustering stage of the DSIFT-BOF algorithm (Fig. 2). As previously 
mentioned, we have employed the k-means method for clustering, which we did by 
using the fast C implementation of k-means with Matlab interface, VGG K-means'^, 
that can deal with large dimensional matrix. In this implementation the number of 
clusters can be configured through the option 'nclus'. The VLFeat open-source 
platform contains as well an implementation of the k-means methods, 'vl_kmeans'. 

Ethics statement. The Institutional Animal Care and Use Committee (lACUC) 
approved all animals -related experiments. The reported methods were carried out in 
accordance with the approved guidelines. 
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